1. Import Libraries¶2.Assess Data Properties¶3. Univariate Exploration¶4. Bivariate Exploration¶5. Multivariate Exploration¶6. References¶import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib Inline
%config InlineBackend.figure_format = "retina"
df_prospa = pd.read_csv('/Users/s.erudite/Documents/Python Scripts/Prospa Loan App Data Analysis/prosperLoanData.csv')
#A view of how the dataset looks like
df_prospa
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | ... | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 113933 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | ... | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 113934 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | ... | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 113935 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | ... | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |
| 113936 | E6ED3600409833199F711B7 | 1140093 | 2014-01-15 09:27:37.657000000 | NaN | 36 | Current | NaN | 0.13189 | 0.1039 | 0.0939 | ... | -1.70 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
113937 rows × 81 columns
#Statistical description of the data
df_prospa.describe()
| ListingNumber | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113937.000000 | 113912.000000 | 113937.000000 | 113937.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | ... | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 6.278857e+05 | 40.830248 | 0.218828 | 0.192764 | 0.182701 | 0.168661 | 0.080306 | 0.096068 | 4.072243 | 5.950067 | ... | -54.725641 | -14.242698 | 700.446342 | 681.420499 | 25.142686 | 0.998584 | 0.048027 | 0.023460 | 16.550751 | 80.475228 |
| std | 3.280762e+05 | 10.436212 | 0.080364 | 0.074818 | 0.074516 | 0.068467 | 0.046764 | 0.030403 | 1.673227 | 2.376501 | ... | 60.675425 | 109.232758 | 2388.513831 | 2357.167068 | 275.657937 | 0.017919 | 0.332353 | 0.232412 | 294.545422 | 103.239020 |
| min | 4.000000e+00 | 12.000000 | 0.006530 | 0.000000 | -0.010000 | -0.182700 | 0.004900 | -0.182700 | 1.000000 | 1.000000 | ... | -664.870000 | -9274.750000 | -94.200000 | -954.550000 | 0.000000 | 0.700000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 4.009190e+05 | 36.000000 | 0.156290 | 0.134000 | 0.124200 | 0.115670 | 0.042400 | 0.074080 | 3.000000 | 4.000000 | ... | -73.180000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 6.005540e+05 | 36.000000 | 0.209760 | 0.184000 | 0.173000 | 0.161500 | 0.072400 | 0.091700 | 4.000000 | 6.000000 | ... | -34.440000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 44.000000 |
| 75% | 8.926340e+05 | 36.000000 | 0.283810 | 0.250000 | 0.240000 | 0.224300 | 0.112000 | 0.116600 | 5.000000 | 8.000000 | ... | -13.920000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 115.000000 |
| max | 1.255725e+06 | 60.000000 | 0.512290 | 0.497500 | 0.492500 | 0.319900 | 0.366000 | 0.283700 | 7.000000 | 11.000000 | ... | 32.060000 | 0.000000 | 25000.000000 | 25000.000000 | 21117.900000 | 1.012500 | 39.000000 | 33.000000 | 25000.000000 | 1189.000000 |
8 rows × 61 columns
#What datatypes does it contain?
df_prospa.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
#Let's return the number of rows and columns in the dataset
df_prospa.shape
(113937, 81)
df_prospa['Occupation'].value_counts()
Other 28617
Professional 13628
Computer Programmer 4478
Executive 4311
Teacher 3759
...
Dentist 68
Student - College Freshman 41
Student - Community College 28
Judge 22
Student - Technical School 16
Name: Occupation, Length: 67, dtype: int64
new_col = ['ListingNumber', 'ListingCreationDate', 'LoanOriginalAmount', 'BorrowerRate', 'LoanStatus', 'BorrowerAPR',
'StatedMonthlyIncome', 'Term', 'ProsperRating (Alpha)', 'EmploymentStatus', 'Occupation',
'BorrowerState', 'DebtToIncomeRatio', 'EstimatedReturn', 'Investors']
new_prospa = df_prospa[new_col]
new_prospa.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingNumber 113937 non-null int64 1 ListingCreationDate 113937 non-null object 2 LoanOriginalAmount 113937 non-null int64 3 BorrowerRate 113937 non-null float64 4 LoanStatus 113937 non-null object 5 BorrowerAPR 113912 non-null float64 6 StatedMonthlyIncome 113937 non-null float64 7 Term 113937 non-null int64 8 ProsperRating (Alpha) 84853 non-null object 9 EmploymentStatus 111682 non-null object 10 Occupation 110349 non-null object 11 BorrowerState 108422 non-null object 12 DebtToIncomeRatio 105383 non-null float64 13 EstimatedReturn 84853 non-null float64 14 Investors 113937 non-null int64 dtypes: float64(5), int64(4), object(6) memory usage: 13.0+ MB
new_prospa.head()
| ListingNumber | LoanOriginalAmount | BorrowerRate | LoanStatus | BorrowerAPR | StatedMonthlyIncome | Term | ProsperRating (Alpha) | EmploymentStatus | Occupation | BorrowerState | DebtToIncomeRatio | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 193129 | 9425 | 0.1580 | Completed | 0.16516 | 3083.333333 | 36 | NaN | Self-employed | Other | CO | 0.17 | 258 |
| 1 | 1209647 | 10000 | 0.0920 | Current | 0.12016 | 6125.000000 | 36 | A | Employed | Professional | CO | 0.18 | 1 |
| 2 | 81716 | 3001 | 0.2750 | Completed | 0.28269 | 2083.333333 | 36 | NaN | Not available | Other | GA | 0.06 | 41 |
| 3 | 658116 | 10000 | 0.0974 | Current | 0.12528 | 2875.000000 | 36 | A | Employed | Skilled Labor | GA | 0.15 | 158 |
| 4 | 909464 | 15000 | 0.2085 | Current | 0.24614 | 9583.333333 | 36 | D | Employed | Executive | MN | 0.26 | 20 |
sub_prospa = new_prospa.copy()
sub_prospa.EmploymentStatus.isnull().sum()
2255
# check for null values in Occupation column.
sub_prospa.Occupation.isnull().sum()
3588
sub_prospa.describe()
| ListingNumber | LoanOriginalAmount | BorrowerRate | BorrowerAPR | StatedMonthlyIncome | Term | DebtToIncomeRatio | EstimatedReturn | Investors | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113937.00000 | 113937.000000 | 113912.000000 | 1.139370e+05 | 113937.000000 | 105383.000000 | 84853.000000 | 113937.000000 |
| mean | 6.278857e+05 | 8337.01385 | 0.192764 | 0.218828 | 5.608026e+03 | 40.830248 | 0.275947 | 0.096068 | 80.475228 |
| std | 3.280762e+05 | 6245.80058 | 0.074818 | 0.080364 | 7.478497e+03 | 10.436212 | 0.551759 | 0.030403 | 103.239020 |
| min | 4.000000e+00 | 1000.00000 | 0.000000 | 0.006530 | 0.000000e+00 | 12.000000 | 0.000000 | -0.182700 | 1.000000 |
| 25% | 4.009190e+05 | 4000.00000 | 0.134000 | 0.156290 | 3.200333e+03 | 36.000000 | 0.140000 | 0.074080 | 2.000000 |
| 50% | 6.005540e+05 | 6500.00000 | 0.184000 | 0.209760 | 4.666667e+03 | 36.000000 | 0.220000 | 0.091700 | 44.000000 |
| 75% | 8.926340e+05 | 12000.00000 | 0.250000 | 0.283810 | 6.825000e+03 | 36.000000 | 0.320000 | 0.116600 | 115.000000 |
| max | 1.255725e+06 | 35000.00000 | 0.497500 | 0.512290 | 1.750003e+06 | 60.000000 | 10.010000 | 0.283700 | 1189.000000 |
#Find the NA, NULL, and NaT rows in the dataset
sub_prospa.isnull().sum()
ListingNumber 0 ListingCreationDate 0 LoanOriginalAmount 0 BorrowerRate 0 LoanStatus 0 BorrowerAPR 25 StatedMonthlyIncome 0 Term 0 ProsperRating (Alpha) 29084 EmploymentStatus 2255 Occupation 3588 BorrowerState 5515 DebtToIncomeRatio 8554 EstimatedReturn 29084 Investors 0 dtype: int64
#Drop the NA rows in from the dataset
sub_prospa = sub_prospa.dropna()
#Confirm that no NA rows in the dataset
sub_prospa.isnull().sum()
ListingNumber 0 ListingCreationDate 0 LoanOriginalAmount 0 BorrowerRate 0 LoanStatus 0 BorrowerAPR 0 StatedMonthlyIncome 0 Term 0 ProsperRating (Alpha) 0 EmploymentStatus 0 Occupation 0 BorrowerState 0 DebtToIncomeRatio 0 EstimatedReturn 0 Investors 0 dtype: int64
sub_prospa.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 76224 entries, 1 to 113936 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingNumber 76224 non-null int64 1 ListingCreationDate 76224 non-null object 2 LoanOriginalAmount 76224 non-null int64 3 BorrowerRate 76224 non-null float64 4 LoanStatus 76224 non-null object 5 BorrowerAPR 76224 non-null float64 6 StatedMonthlyIncome 76224 non-null float64 7 Term 76224 non-null int64 8 ProsperRating (Alpha) 76224 non-null object 9 EmploymentStatus 76224 non-null object 10 Occupation 76224 non-null object 11 BorrowerState 76224 non-null object 12 DebtToIncomeRatio 76224 non-null float64 13 EstimatedReturn 76224 non-null float64 14 Investors 76224 non-null int64 dtypes: float64(5), int64(4), object(6) memory usage: 9.3+ MB
sub_prospa.head(10)
| ListingNumber | ListingCreationDate | LoanOriginalAmount | BorrowerRate | LoanStatus | BorrowerAPR | StatedMonthlyIncome | Term | ProsperRating (Alpha) | EmploymentStatus | Occupation | BorrowerState | DebtToIncomeRatio | EstimatedReturn | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1209647 | 2014-02-27 08:28:07.900000000 | 10000 | 0.0920 | Current | 0.12016 | 6125.000000 | 36 | A | Employed | Professional | CO | 0.18 | 0.05470 | 1 |
| 3 | 658116 | 2012-10-22 11:02:35.010000000 | 10000 | 0.0974 | Current | 0.12528 | 2875.000000 | 36 | A | Employed | Skilled Labor | GA | 0.15 | 0.06000 | 158 |
| 4 | 909464 | 2013-09-14 18:38:39.097000000 | 15000 | 0.2085 | Current | 0.24614 | 9583.333333 | 36 | D | Employed | Executive | MN | 0.26 | 0.09066 | 20 |
| 5 | 1074836 | 2013-12-14 08:26:37.093000000 | 15000 | 0.1314 | Current | 0.15425 | 8333.333333 | 60 | B | Employed | Professional | NM | 0.36 | 0.07077 | 1 |
| 6 | 750899 | 2013-04-12 09:52:56.147000000 | 3000 | 0.2712 | Current | 0.31032 | 2083.333333 | 36 | E | Employed | Sales - Retail | KS | 0.27 | 0.11070 | 1 |
| 7 | 768193 | 2013-05-05 06:49:27.493000000 | 10000 | 0.2019 | Current | 0.23939 | 3355.750000 | 36 | C | Employed | Laborer | CA | 0.24 | 0.09840 | 1 |
| 8 | 1023355 | 2013-12-02 10:43:39.117000000 | 10000 | 0.0629 | Current | 0.07620 | 3333.333333 | 36 | AA | Employed | Food Service | IL | 0.25 | 0.04231 | 1 |
| 9 | 1023355 | 2013-12-02 10:43:39.117000000 | 10000 | 0.0629 | Current | 0.07620 | 3333.333333 | 36 | AA | Employed | Food Service | IL | 0.25 | 0.04231 | 1 |
| 10 | 587746 | 2012-05-10 07:04:01.577000000 | 13500 | 0.2489 | Current | 0.27462 | 7500.000000 | 60 | C | Employed | Fireman | MD | 0.12 | 0.14420 | 19 |
| 12 | 1081604 | 2013-12-15 20:01:10.757000000 | 4000 | 0.1435 | Past Due (1-15 days) | 0.17969 | 2416.666667 | 36 | B | Employed | Sales - Retail | AL | 0.18 | 0.07400 | 1 |
# plt.rcParams['font.family'] = 'sans-serif'
# plt.rcParams['font.sans-serif'] = ['Tahoma']
font = {'family' : 'Tahoma',
'weight' : 'bold',
'size' : 10}
plt.rc('font', **font)
This cleaned data set includes details on 76224 loans for 14 different variables. The majority of the variables are numerical, while Loan Status is a nominal category variable.
Original Loan Amount, Borrower Annual Percentage Rate (BorrowerAPR) and BorrowerRate. Predict what factors affect them.
I predict that Monthly paycheck size of the borrower, the original amount of the loan requested, Employment status, and the kind of Occupation will affect the features of interest.
print(sub_prospa.BorrowerRate.min())
print()
print(sub_prospa.BorrowerRate.max())
0.04 0.36
#Define y axis label function that will be used in many charts in this univariate exploration
def disfreq():
plt.ylabel('Distribution Frequency', fontsize=15)
#Distribution of BorrowerAPR
bins = np.arange(0, sub_prospa['BorrowerAPR'].max(),0.01)
plt.figure(figsize=[10,6])
plt.hist(data = sub_prospa, x = 'BorrowerAPR', bins = bins);
plt.xlabel('BorrowerAPR', fontsize=15)
disfreq()
plt.title('Univariate Exploration of Borrower Annual Percentage Rating', fontsize=15);
#Distribution of BorrowerAPR
bins = np.arange(0, sub_prospa['BorrowerRate'].max(),0.01)
plt.figure(figsize=[10,6])
plt.hist(data = sub_prospa, x = 'BorrowerRate', bins = bins);
plt.xlabel('BorrowerRate', fontsize=15)
disfreq()
plt.title('Univariate Exploration of Borrower Rate', fontsize=15);
The distribution of BorrowerAPR and BorrowerRate is multimodal in nature
What amount is most borrowed?
print(sub_prospa.LoanOriginalAmount.min());
sub_prospa.LoanOriginalAmount.max()
1000
35000
def fig():
bin_size = 500
plt.figure(figsize=[10,6])
fig_def()
bins = np.arange(1000, sub_prospa.LoanOriginalAmount.max()+bin_size, bin_size)
plt.hist(data = sub_prospa, x = 'LoanOriginalAmount', bins = bins);
plt.xlabel('Original loan amount ($)', fontsize=15)
disfreq()
plt.title('Distribution of Original Loan Amount', fontsize=15);
4k, 10k and 15k are the most borrowed amounts in Prosper loan app
#Distribution of Stated Monthly Income
fig_def()
bins = np.arange(1000, sub_prospa.LoanOriginalAmount.max()+bin_size, bin_size)
plt.hist(data = sub_prospa, x = 'StatedMonthlyIncome', bins = bins);
plt.xlabel('Stated Monthly Income ($)', fontsize=15)
disfreq()
plt.title('Distribution of Stated Monthly Income ($)', fontsize=15);
#Percentage of borrowers that earn 15k and below
((sub_prospa.StatedMonthlyIncome<=15000).sum()/float(sub_prospa.shape[0]))*100
97.47323677581863
# distribution of DebtToIncomeRatio
bins = np.arange(0,1, 0.01)
plt.figure(figsize=[8, 5])
plt.hist(data = sub_prospa, x = 'DebtToIncomeRatio', bins = bins);
plt.xlabel('Debt To Income Ratio', fontsize = 15)
disfreq()
plt.title('Distribution of Debt To Income Ratio', fontsize=15);
base_colour = sns.color_palette()[2]
sub_prospa.EmploymentStatus.value_counts()
Employed 65884 Full-time 7584 Other 2194 Retired 320 Part-time 199 Self-employed 42 Not employed 1 Name: EmploymentStatus, dtype: int64
#Visualise the count
plt.figure(figsize=[10,6])
sns.countplot(data = sub_prospa, y = 'EmploymentStatus', color = base_colour);
plt.xlabel('Count', fontsize=15)
plt.ylabel('Employment Status', fontsize=15)
plt.title('Value count of the Borrowers\' employment status', fontsize=15);
sub_prospa.Occupation.value_counts()
Other 18501
Professional 9917
Executive 3206
Computer Programmer 3038
Teacher 2777
...
Student - College Junior 19
Student - College Freshman 12
Student - College Sophomore 9
Student - Community College 5
Student - Technical School 1
Name: Occupation, Length: 67, dtype: int64
plt.figure(figsize=[10,15])
sns.countplot(data = sub_prospa, y = 'Occupation', color = base_colour);
plt.xlabel('Count', fontsize = 15)
plt.ylabel('Occupation', fontsize = 15)
plt.title('Value Count of the Occupation Type', fontsize=15)
Text(0.5, 1.0, 'Value Count of the Occupation Type')
The Borrower APR is slightly multimodal and the values are between 0.05 to 0.4
The distribution of stated monthly income is skewed to the right and most 97% of the borrowers earn below 15k per month
Another observation shows that most of the borrowers are employed.
To achieve the goal of my analysis on the sub-dataset, I dropped all null values since they weren't much enough as to negatively bias the final result.
Will higher loan amount attract lower BorrowerAPR? I predict it should, but don't bank on my assumption, let the data tell us graphically.
plt.figure(figsize = (12,6))
sns.regplot(data = sub_prospa, x = 'BorrowerAPR', y = 'LoanOriginalAmount', truncate=False, x_jitter = 0.1,scatter_kws = {'alpha':1/20});
plt.xlabel('Borrower APR', fontsize = 15)
plt.ylabel('Loan Original Amount', fontsize = 15)
plt.title('Relationship Between BorrowerAPR and Loan Original Amount', fontsize=15);
We observed a negative correlation between Loan Original Amount and Borrower APR, that means as I earlier predicted higher loan amounts had lower borrower annual percentage return
plt.figure(figsize = [24, 12])
sns.boxplot(data = sub_prospa, x = 'LoanStatus', y= 'LoanOriginalAmount', color = base_colour)
plt.xticks(rotation=90)
plt.xlabel("Status",fontsize = 20);
plt.ylabel("Loan Original Amount",fontsize = 20);
plt.title("Correlation Between Loan Status and Loan Original Amount",fontsize = 25);
How does employment status perform across different Prosper Rating
sub_prospa.EmploymentStatus.value_counts()
Employed 65884 Full-time 7584 Other 2194 Retired 320 Part-time 199 Self-employed 42 Not employed 1 Name: EmploymentStatus, dtype: int64
sub_prospa['ProsperRating (Alpha)'].value_counts()
C 16501 B 14379 A 13491 D 12631 E 8443 HR 5682 AA 5097 Name: ProsperRating (Alpha), dtype: int64
plt.figure(figsize = [16, 10])
plt.yticks(np.arange(0,17000,1000))
sns.countplot(data = sub_prospa, x = 'EmploymentStatus', hue = 'ProsperRating (Alpha)');
plt.legend(loc='upper center', title='ProsperRating (Alpha)')
plt.xlabel('Employment Status', fontsize = 15)
plt.ylabel('Value Count', fontsize = 15)
plt.title('Prosper Rating (Alpha) across EmploymentStatus', fontsize=15)
Text(0.5, 1.0, 'Prosper Rating (Alpha) across EmploymentStatus')
1. Employment Status does not have enough data for Part-time, Retired, Self-employed and Not employed to show its interaction with ProsperRating (Alpha)
2. Most of the employed borrowers were C-rated followed by B and A respectively. Less than 5000 borrowers had the highest rating of AA.
Does employment status influence the amount of loan requested
plt.figure(figsize=[12,8])
sns.boxplot(data = sub_prospa, x = 'LoanOriginalAmount', y= 'EmploymentStatus', palette = 'Reds')
plt.xlabel('Loan Original Amount', fontsize=15)
plt.ylabel('Employment Status', fontsize=15)
plt.title('Relationship Between Loan Original Amount & Employment Status', fontsize=15)
Text(0.5, 1.0, 'Relationship Between Loan Original Amount & Employment Status')
Borrows that have Employed, Self-employed and Others employment status borrow higher amount than part-time, retired,full time and not employed borrowers.
Do higher income earners borrow more money?
sns.set_style('white')
plt.figure(figsize = (12,6))
ax = sns.regplot(x="LoanOriginalAmount", y="StatedMonthlyIncome", data=sub_prospa)
plt.xlabel("LoanOriginalAmount",fontsize = 15);
plt.ylabel("StatedMonthlyIncome",fontsize = 15);
plt.title("Correlation Between Stated Monthly Income and Loan Original Amount",fontsize = 20)
plt.xticks(np.arange(0,35000,5000))
plt.yticks(np.arange(0,500000,50000));
There is no clear positive correlation between stated monthly income and loan original amount requested
Does any form of correlation exist between LoanOriginalAmount and BorrowerRate?
sns.set_style('white')
plt.figure(figsize = (12,6))
ax = sns.regplot(x="LoanOriginalAmount", y="BorrowerRate", data=sub_prospa)
plt.xlabel("LoanOriginalAmount",fontsize = 15);
plt.ylabel("BorrowerRate",fontsize = 15);
plt.title("Correlation Between BorrowerRate and Loan Original Amount",fontsize = 20)
Text(0.5, 1.0, 'Correlation Between BorrowerRate and Loan Original Amount')
There is a negative correlation between the LoanOriginAmount and BorrowerRate.
Obviously, as I had expected that interest rate should be lesser for higher loan amount, the trendline of the scatter plot shows that the negative correlation.
How does the BorrowerAPR compare to the loan Term?
sns.boxplot(data = sub_prospa, x = 'Term', y= 'BorrowerAPR', palette = 'Oranges')
plt.xlabel('Term', fontsize=15)
plt.ylabel('BorrwerAPR', fontsize=15);
plt.title('Relationship Between Term and BorrowerAPR', fontsize=15)
Text(0.5, 1.0, 'Relationship Between Term and BorrowerAPR')
36 months term loans have higher BorrowerAPR than 12 or 60 months term.
Does loan term influence the BorrowerRate? I am assuming that shorter loans should attract higher interest rate.
sns.boxplot(data = sub_prospa, x = 'Term', y= 'BorrowerRate', palette = 'Oranges')
plt.xlabel('Term', fontsize=15)
plt.ylabel('Borrwer Rate', fontsize=15);
plt.title('Relationship Between Term and Borrower Rate', fontsize=15)
Text(0.5, 1.0, 'Relationship Between Term and Borrower Rate')
75% of 12 months term loans have interest rate below 20%.
This was a bit surprising for me. I had envisaged that shorter term loans should come with higher interest but obviously the data said something else. Longer term loans attracted higher interest.
I assume the management of Prosper Loan App sdopted this strategy to encourage faster loan repayment.
My key interest here is to investigate how the relationship between LoanOriginalAmount and BorrrowerAPR is impacted by categorical variables like Term and Prosper Rating (Alpha).
As a bonus, I will also explore same impact on Loan Original Amount and BorrowerRate
What is the impact of term on Loan Amount and Borrower APR using regplot. (*Bonus: Replace Borrower APR with BorrowerRate and observe if the trend is s
# Impact of Term on Loan Amount and BorrowerAPR relationship
f=sns.FacetGrid(data= sub_prospa, aspect=1.4, height=5, col='Term', col_wrap=2)
f.map(sns.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
f.add_legend()
plt.xlabel("LoanOriginalAmount",fontsize = 15);
plt.ylabel("BorrowerAPR",fontsize = 15);
plt.title("Correlation Between BorrowerAPR and Loan Original Amount",fontsize = 20)
Text(0.5, 1.0, 'Correlation Between BorrowerAPR and Loan Original Amount')
# Impact of Term on Loan Amount and BorrowerRate relationship
f=sns.FacetGrid(data= sub_prospa, aspect=1.4, height=5, col='Term', col_wrap=2)
f.map(sns.regplot, 'LoanOriginalAmount', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
f.add_legend()
plt.xlabel("LoanOriginalAmount",fontsize = 15);
plt.ylabel("BorrowerRate",fontsize = 15);
plt.title("Correlation Between BorrowerRate and Loan Original Amount",fontsize = 20)
Text(0.5, 1.0, 'Correlation Between BorrowerRate and Loan Original Amount')
Generally, there is a negative correlation between LoanOriginalAmount and BorrowerRate for all 3 terms. Similar tren can be observed between LoanOriginalAmount and BorrowerAPR.
How does Prosper Rating affect the relation between Borrower APR (Annual Percentage Rating) and Loan Original Amount
# Convert Employment Status and Prosper Rating into ordered categorical types
p_rating_order = ['AA','A','B','C','D','E','HR']
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = p_rating_order)
sub_prospa['ProsperRating (Alpha)'] = sub_prospa['ProsperRating (Alpha)'].astype(ordered_var)
employmt_order = ['Employed','Self-employed','Full-time','Part-time','Retired','Other','Not employed', 'Not available']
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = employmt_order)
sub_prospa['EmploymentStatus'] = sub_prospa['EmploymentStatus'].astype(ordered_var)
# Impact of Term on Loan Amount and BorrowerAPR relationship
g=sns.FacetGrid(data= sub_prospa, aspect=1.4, height=5, col='ProsperRating (Alpha)', col_wrap=2)
g.map(sns.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend()
plt.xlabel("LoanOriginalAmount",fontsize = 15);
plt.ylabel("BorrowerAPR",fontsize = 15);
plt.title("Correlation Between BorrowerAPR and Loan Original Amount",fontsize = 20)
Text(0.5, 1.0, 'Correlation Between BorrowerAPR and Loan Original Amount')
The borrower APR and Loan Original Amount have a positive link. However, the relationship becomes negative as the rating drops from AA to HR. I believe that Prosper executives purposefully increased the APR for high-rated customers as the loan amount requested increased in order to maximise returns from the transaction (possibly because these customers have been with them for a long time and they are already loyal to the brand). In contrast, those with lower prosper ratings have lower APRs as the loan amount increases. I think this is being done on purpose to entice new clients—who most likely have low APRs—to try out the service.
How does Prosper Rating affect the relation between Borrower Rate (Interest Rate) and Loan Original Amount
# Impact of Term on Loan Amount and BorrowerAPR relationship
g=sns.FacetGrid(data= sub_prospa, aspect=1.4, height=5, col='ProsperRating (Alpha)', col_wrap=2)
g.map(sns.regplot, 'LoanOriginalAmount', 'BorrowerRate', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend()
plt.xlabel("LoanOriginalAmount",fontsize = 15);
plt.ylabel("BorrowerRate",fontsize = 15);
ax.set_title("Correlation Between BorrowerRate and Loan Original Amount", y=1.0, pad=-14)
# plt.title("Correlation Between BorrowerRate and Loan Original Amount",fontsize = 20)
# ax.set_title('Manual y', y=1.0, pad=-14)
Text(0.5, 1.0, 'Correlation Between BorrowerRate and Loan Original Amount')
Similar conclusion can be drawn for this relationship between Loan Original Amount and Borrower Rate as in that of Loan Original Amount vs BorrowerAPR above.
Using Seaborn pointplot, can we see how loan term affect the relationship between ProspersRating and BorrowerAPR
#visualise the relationship
fig_size = plt.figure(figsize = [10,6])
ax = sns.pointplot(data = sub_prospa, x = 'ProsperRating (Alpha)', y = 'BorrowerAPR', hue = 'Term',
palette = 'Greens', linestyles = '', dodge = 0.4, ci='sd')
plt.title('Borrower APR across rating and term', fontsize=15)
plt.xlabel('ProsperRating (Alpha)', fontsize=15)
plt.ylabel('Borrower APR', fontsize=15)
ax.set_yticklabels([],minor = True);
Highly rated borrowers (AA-B) have lower APR, though there is an incremental difference as the loan term increases from 12-60. But poorly rated borrowers attract higher APR.
Using Seaborn pointplot, can we see how loan term affect the relationship between Employment Status and BorrowerAPR?
#visualise the relationship
fig_size = plt.figure(figsize = [10,6])
ax = sns.pointplot(data = sub_prospa, x = 'EmploymentStatus', y = 'BorrowerAPR', hue = 'Term',
palette = 'Greens', linestyles = '', dodge = 0.4, ci='sd')
plt.title('Impact of loan term on Employment Status and BorrowerAPR?', fontsize=15)
plt.xlabel('Employment Status', fontsize=15)
plt.ylabel('Borrower APR', fontsize=15)
ax.set_yticklabels([],minor = True);
Highly rated borrowers (AA-B) have lower APR, though there is an incremental difference as the loan term increases from 12-60. But poorly rated borrowers attract higher APR.
Visualise the impact of Term on the relationship between ProsperRating (Alpha) & LoanOriginalAmount; and ProsperRating (Alpha) & StatedMonthlyIncome
fig, ax = plt.subplots(ncols=2, figsize=[15,8])
sns.pointplot(data = sub_prospa, x = 'ProsperRating (Alpha)', y = 'LoanOriginalAmount', hue = 'Term',
palette = 'gnuplot', linestyles = '', dodge = False, ax=ax[0], fontsize=15)
sns.pointplot(data = sub_prospa, x = 'ProsperRating (Alpha)', y ='StatedMonthlyIncome', hue = 'Term',
palette = 'gnuplot2', linestyles = '', dodge = 0.3, ax=ax[1]);
plt.xlabel('ProsperRating (Alpha)', fontsize=15)
plt.ylabel('StatedMonthlyIncome', fontsize=15)
Text(0, 0.5, 'StatedMonthlyIncome')
Borrowers with high monthly income and prosper rating tend to borrow loans of 12 months term duration.
The borrower APR and Loan Original Amount have a positive link. However, the relationship becomes negative as the rating drops from AA to HR.
Further exploration on the influence of loan term and prosper rating on the original loan amount shows that for better rating, the amount increases for all three terms.
Unexpectedly, the borrower APR and loan amount have a negative link when the borrower's Prosper rating is between HR and B, but a positive correlation when the borrower's rating is between A and AA. Another intriguing finding is that for borrowers with HR-C rates, the borrower APR decreases as the borrow time lengthens. However, the APR rises with the length of the loan for those with B-AA credit ratings.